19 research outputs found
Scaling Laws for Discriminative Speech Recognition Rescoring Models
Recent studies have found that model performance has a smooth power-law
relationship, or scaling laws, with training data and model size, for a wide
range of problems. These scaling laws allow one to choose nearly optimal data
and model sizes. We study whether this scaling property is also applicable to
second-pass rescoring, which is an important component of speech recognition
systems. We focus on RescoreBERT as the rescoring model, which uses a
pre-trained Transformer-based architecture fined tuned with an ASR
discriminative loss. Using such a rescoring model, we show that the word error
rate (WER) follows a scaling law for over two orders of magnitude as training
data and model size increase. In addition, it is found that a pre-trained model
would require less data than a randomly initialized model of the same size,
representing effective data transferred from pre-training step. This effective
data transferred is found to also follow a scaling law with the data and model
size
Discriminative Speech Recognition Rescoring with Pre-trained Language Models
Second pass rescoring is a critical component of competitive automatic speech
recognition (ASR) systems. Large language models have demonstrated their
ability in using pre-trained information for better rescoring of ASR
hypothesis. Discriminative training, directly optimizing the minimum
word-error-rate (MWER) criterion typically improves rescoring. In this study,
we propose and explore several discriminative fine-tuning schemes for
pre-trained LMs. We propose two architectures based on different pooling
strategies of output embeddings and compare with probability based MWER. We
conduct detailed comparisons between pre-trained causal and bidirectional LMs
in discriminative settings. Experiments on LibriSpeech demonstrate that all
MWER training schemes are beneficial, giving additional gains upto 8.5\% WER.
Proposed pooling variants achieve lower latency while retaining most
improvements. Finally, our study concludes that bidirectionality is better
utilized with discriminative training.Comment: ASRU 202
Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting
We explore the ability of large language models (LLMs) to act as speech
recognition post-processors that perform rescoring and error correction. Our
first focus is on instruction prompting to let LLMs perform these task without
fine-tuning, for which we evaluate different prompting schemes, both zero- and
few-shot in-context learning, and a novel task activation prompting method that
combines causal instructions and demonstration to increase its context windows.
Next, we show that rescoring only by in-context learning with frozen LLMs
achieves results that are competitive with rescoring by domain-tuned LMs, using
a pretrained first-pass recognition system and rescoring output on two
out-of-domain tasks (ATIS and WSJ). By combining prompting techniques with
fine-tuning we achieve error rates below the N-best oracle level, showcasing
the generalization power of the LLMs.Comment: Accepted to IEEE Automatic Speech Recognition and Understanding
(ASRU) 2023. 8 pages. 2nd version revised from Sep 29th's versio
Personalization for BERT-based Discriminative Speech Recognition Rescoring
Recognition of personalized content remains a challenge in end-to-end speech
recognition. We explore three novel approaches that use personalized content in
a neural rescoring step to improve recognition: gazetteers, prompting, and a
cross-attention based encoder-decoder model. We use internal de-identified
en-US data from interactions with a virtual voice assistant supplemented with
personalized named entities to compare these approaches. On a test set with
personalized named entities, we show that each of these approaches improves
word error rate by over 10%, against a neural rescoring baseline. We also show
that on this test set, natural language prompts can improve word error rate by
7% without any training and with a marginal loss in generalization. Overall,
gazetteers were found to perform the best with a 10% improvement in word error
rate (WER), while also improving WER on a general test set by 1%
Low-rank Adaptation of Large Language Model Rescoring for Parameter-Efficient Speech Recognition
We propose a neural language modeling system based on low-rank adaptation
(LoRA) for speech recognition output rescoring. Although pretrained language
models (LMs) like BERT have shown superior performance in second-pass
rescoring, the high computational cost of scaling up the pretraining stage and
adapting the pretrained models to specific domains limit their practical use in
rescoring. Here we present a method based on low-rank decomposition to train a
rescoring BERT model and adapt it to new domains using only a fraction (0.08%)
of the pretrained parameters. These inserted matrices are optimized through a
discriminative training objective along with a correlation-based regularization
loss. The proposed low-rank adaptation Rescore-BERT (LoRB) architecture is
evaluated on LibriSpeech and internal datasets with decreased training times by
factors between 5.4 and 3.6.Comment: Accepted to IEEE ASRU 2023. Internal Review Approved. Revised 2nd
version with Andreas and Huck. The first version is in Sep 29th. 8 page
Analysis of gas-particle flows through multi-scale simulations
Multi-scale structures are inherent in gas-solid flows, which render the modeling efforts challenging. On one hand, detailed simulations where the fine structures are resolved and particle properties can be directly specified can account for complex flow behaviors, but they are too computationally expensive to apply for larger systems. On the other hand, coarse-grained simulations demand much less computations but they necessitate constitutive models which are often not readily available for given particle properties. The present study focuses on addressing this issue, as it seeks to provide a general framework through which one can obtain the required constitutive models from detailed simulations.
To demonstrate the viability of this general framework in which closures can be proposed for different particle properties, we focus on the van der Waals force of interaction between particles. We start with Computational Fluid Dynamics (CFD) - Discrete Element Method (DEM) simulations where the fine structures are resolved and van der Waals force between particles can be directly specified, and obtain closures for stress and drag that are required for coarse-grained simulations. Specifically, we develop a new cohesion model that appropriately accounts for van der Waals force between particles to be used for CFD-DEM simulations. We then validate this cohesion model and the CFD-DEM approach by showing that it can qualitatively capture experimental results where the addition of small particles to gas fluidization reduces bubble sizes. Based on the DEM and CFD-DEM simulation results, we propose stress models that account for the van der Waals force between particles. Finally, we apply machine learning, specifically neural networks, to obtain a drag model that captures the effects from fine structures and inter-particle cohesion. We show that this novel approach using neural networks, which can be readily applied for other closures other than drag here, can take advantage of the large amount of data generated from simulations, and therefore offer superior modeling performance over traditional approaches